Power Tools 1993 November

home *** CD-ROM | disk | FTP | other *** search

/ Power Tools 1993 November - Disc 2 / Power Tools Plus (Disc 2 of 2)(November 1993)(HP).iso / hotlines / gsyhl / hpuxcomp.txt < prev next >

Wrap

Text File | 1993-03-29 | 14KB | 259 lines

The Power of HP-UX Compiler Optimizations 1.0 Abstract Any code compiled without optimization may be missing a great opportunity for performance improvement. A study of the effect of compiler optimization on the benchmarks in the SPEC CINT92 suite shows that, compared to unoptimized code, there is an easy 15 to 25% gain available and performance gains of 25 to 60% may be available to those who utilize more of the optimizer. Note that the benchmarks studied are not highly-tunable or easily-optimized floating-point benchmarks, these are C language, integer benchmarks which are much like many of the HP-UX applications. 2.0 Executive Summary While the exact effect varied from benchmark code to benchmark code, just turning on Level One (+O1), which should be safe on virtually all code, generally gained about 20% faster execution. Using Level Two (-O or +O2), the standard level of optimization, generally gained up to 50% performance increase. And if every feature available was utilized the gains were usually in the 40 - 60% range, with one benchmark that did better than a two-fold increase in performance. It is well worth investigating the use of optimization for any code compiled on a HP-UX 9.0 or later system, even if optimization was not considered impor- tant for earlier releases. The 9.0 and later compilers trade-off simpler code generation for faster compile and link times if no optimization is specified. Thus with 9.0, if no optimization is used, it is very likely that there will be a 10% performance drop even before any kernel or system impacts are considered. 3.0 Observations The use of just Level One optimization is virtually a free 20% performance gain over not using any optimization. Level One optimization just applies the simple, in- line, statement-level, safe optimizations which can be used by just about any piece of code. Still, these easy optimizations provided between 15 and 25% improvement in most cases in the CINT92 suite. Level Two optimization, the standard level of optimization (-O), offers another performance gain ranging from 5 - 10% percent for "branchy" system-type code to 20 - 30% for more "loopy" analytic-style code. There are benefits for almost all programs in Level Two optimizations, including a nice reduction in the overhead of procedure calling, but the big gains will be in those pieces of code which spend a fair amount of time in a few key loops. For the one case, out of the six in CINT92, that best exemplifies this "loopy" behavior, Level Two brought almost a two-fold performance improvement over Level One. At this point, there was only a small gain by attempting Level Three optimizations on the examples in the CINT92 suite. There seems to be less advantage in the ad- vanced optimization methods for this kind of system application C code. However, Level Three did have a significant impact in the results of the CFP92, the floating- point suite, which has benchmarks which are much more analytic in nature. However, there is another optimization that is well worth considering. The use of archive libraries, rather than the default of shared libraries can make a noticeable difference. Even code which rarely made any library calls had a detectable, 1 - 2%, difference when linked shared. But code which makes a high number of calls to library routines may see as much as 20% improvement with the use of archived executables rather than dynamically linked shared libraries. 4.0 Results 4.1 Result Summary Across this variety of benchmarks, the HP-UX 9.0 compilers average better than 20% improvement on Level One (+O1) optimization alone. Turning on Level Two generates an average improvement of almost 60%. The use of archived libraries rather than shared libraries offers an additional 8% performance. This gives a 68% total improvements over the defaults, all of which is available without touching any source code. Table 1: Relative Speedups with Various Levels of Optimization SPECint92 none +O1 +O2 +O3 +O3archived 008 1.00 1.26 1.59 1.59 1.66 022 1.00 1.25 1.52 1.54 1.57 023 1.00 1.38 2.62 2.62 2.77 026 1.00 1.18 1.40 1.42 1.46 072 1.00 1.05 1.11 1.11 1.30 085 1.00 1.17 1.24 1.24 1.34 Average 1.00 1.21 1.58 1.59 1.68 4.2 Individual Benchmarks The individual results show a fair amount of variation, reflecting the variety in the six members of the CINT92 suite. SPEC recommends that people do not rely upon the summary metrics, but rather determine which benchmarks match their interests and to look carefully at those. With that in mind, the following short descriptions might be helpful in reviewing these results. o Espresso and Eqntott are both from scientific applications and should show the most improvement from compiler optimizations. o Li, as a Lisp interpreter recursing and backtracking through the N-Queens problem, would showcase the procedure calling overhead. o Compress and SC are much like many UN*X commands. o GCC represents large system applications. 4.2.1 085.gcc The code in the 085.gcc benchmark is version 1.35 of the Gnu C Compiler. The benchmark measures the time it takes a system to use this compiler to generate several executables for the old Sun-3 workstations. The Gnu Compiler is over 50,000 lines of C code (not counting comments, blank space, etc.), much of which is fairly representative of large system applications. Because of its size and its behavior, which is typical of large software projects, this benchmark has been used by many as a predictor for performance of both kernel and large application behavior. There was a 17% improvement going from none to just Level One optimization. There was almost 25% improvement going from none to Level Two optimization. Additionally, the use of archive libraries rather than shared libraries obtained another 10% improvement. All this in typical system level code, which is supposed to render optimizers ineffective. 4.2.2 072.sc This is the public domain UN*X spreadsheet program sc(1) run against three dif- ferent inputs: a mortgage cost calculation, a small budget calculation, and a SPECmark89 result calculation. The program sc(1) makes great use of the curses(3) package to do the screen handling and give a Lotus 1-2-3 look. The benchmark en- forces a common vt220 terminal definition for the curses(3) handling no matter what the underlying system, so that all systems are doing the same work. This benchmark was the one in the suite which showed the least improvement, but then 072.sc is the benchmark which spends a very large fraction of its time in library routines: 80%. All of that time spent in the libraries is unaffected by any compiler optimizations. Even with spending only 1/5th of its time in the code that the compilers can have an effect upon, Level One optimization got a gain of 5%, and Level Two brought the advantage up over 10%. This implies that the optimizer made differences of 25 and 50% in the code that it operated upon. However, with all the calls to the library routines, this is the benchmark which gains the most from using archived instead of shared libraries, almost 20%. Thus, it is very important that those programs which make a lot of calls to library routines are linked from the archived libraries rather than from the shared libraries. Together, even this benchmark, can run 30% faster than with the defaults. 4.2.3 026.compress The 026.compress benchmark is from a public-domain version of the UN*X compress(1) utility using the Lempel-Ziv algorithm. This benchmark reads its standard input, which is a 1 MB file of random text, and writes a compressed form to the standard output; then the compressed file is fed back through standard IO to be uncompressed. The code loops through reading in a block of data, computing its compressed form, and then writing that out. This makes for a typical UN*X filter, though perhaps with a bit higher ratio of compute to IO. Perhaps owing to the fair bit of computation per IO buffer, this benchmark gains almost 20% with just Level One optimization. Then, this performance gain is doubled when Level Two optimization is enabled. Here again, the amount of computa- tion per buffer, and in particular the looping nature of that computation, allows the optimizer to work well. Not surprisingly, for a benchmark which makes comparably few library calls, the difference between shared and archived libraries was not significant. But again, the total increase was over 45% better than the default settings. 4.2.4 023.eqntott One of the most analytical of the CINT92 suite, 023.eqntott is based on a tool from the area of logic design which translates boolean equations into truth tables. This integer benchmark is probably the most like the classic floating point benchmarks: lots of activity inside loops, not a lot of unexpected branches. Therefore, it is the most likely benchmark to highlight the effect of the optimizer's capabilities. As expected, this benchmark shows considerable improvement even from just Level One optimization: 38%. Then, like most scientific applications, Level Two opti- mization has a great effect, in this case it approaches a 3-fold increase in performance. And again, even after a considerable speed-up, the use of archived libraries adds a noticeable gain. 4.2.5 022.li This is the benchmark in the CINT92 suite which most exercises the procedure calling convention, 022.li is a Lisp interpreter running a code which attempts to solve the 9-Queens problem with a recursive backtracking algorithm. Again, on this benchmark, Level One gains a 25% performance increase. Level Two optimization doubles this gain for a total of over a 50% increase. Once again, just using the default settings leaves out a considerable amount of performance. 4.2.6 008.espresso This benchmark is taken from another logic design application, in this case the application generates and optimizes Progammable Logic Arrays. This is another of the more computational benchmarks. In this case, even Level One gets over 25% improvement. And, one more time, the use of Level Two does better than double the advantage, to almost 60% better than the basic compilations. On top of that, building archived rather than with shared libraries gains several more percentage points, to where this benchmark measures 66% better fully optimized than it measured under default conditions. 4.3 8.02 Results There were a lot of enhancements made to the HP-UX 9.0 compilers, but even so with HP-UX 8.0 it was still worth a fair bit to turn on the optimizer. As detailed in the table, Level One gained over 15% percent, and Level Two gets that much again for a total possible improvement of over 30%. Table 2: HP-UX 8.02: Relative Speedups of Various Levels of Optimization SPECint92 none +O1 +O2 +O3 +O3archived 008 1.00 1.23 1.41 1.41 1.46 022 1.00 1.16 1.31 1.31 1.30 023 1.00 1.34 1.74 1.73 1.81 026 1.00 1.08 1.24 1.24 1.25 072 1.00 1.03 1.06 1.06 1.15 085 1.00 1.14 1.18 1.18 1.23 Average 1.00 1.16 1.32 1.32 1.37 Most importantly, however, are the differences in the compilers between 8.02 and 9.0. Starting in 9.0, the compilers are using long branches exclusively unless the optimizer is activated. This makes the work of the linker much easier, resulting in much faster link times. The effect of this is that code compiled without optimization will run almost 10% slower under 9.0 than 8.02 without even considering what the kernel and the rest of the system does to the performance. Table 3: 9.0 versus 8.02: Relative Speeds at Various Levels of Optimizat ion SPECint92 none +O1 +O2 +O3 +O3archived 008 0.75 0.76 0.84 0.84 0.85 022 0.89 0.96 1.04 1.05 1.07 023 0.82 0.84 1.24 1.24 1.26 026 0.91 1.00 1.03 1.05 1.07 072 1.17 1.19 1.22 1.22 1.32 085 0.91 0.94 0.96 0.96 1.00 Average 0.91 0.95 1.05 1.06 1.09 5.0 Conclusion Compiler optimization technology unlocks the performance potential of the PA-RISC architecture. The RISC philosophy has fundamentally changed the role of the compiler. As RISC moves to strike a balance between hardware and software that exploits the best of each technology, the resulting simple, high-perfor- mance instruction set enables the compiler to apply optimizations that drama- tically improve performance. The effectiveness of RISC depends on the compi- ler's ability to create the optimal instruction sequence by appropriately re- arranging the program steps. Without these optimizations, many applications will execute at a performance level far below their potential.